Skip to content

Conversation

@htahir1
Copy link
Contributor

@htahir1 htahir1 commented Nov 2, 2025

Describe changes

I implemented a new Hugging Face deployer component that allows deploying containerized applications to Hugging Face Spaces. This deployer complements the existing Hugging Face model deployer by providing a more general-purpose deployment option for any containerized application.

The implementation includes:

  • A new HuggingFaceDeployer class that extends ContainerizedDeployer
  • Configuration options for Space hardware, storage, and visibility
  • Support for authentication via token, secret, or environment variable
  • Methods to provision, deprovision, and check deployment status

Pre-requisites

Please ensure you have done the following:

  • I have read the CONTRIBUTING.md document.
  • I have added tests to cover my changes.
  • I have based my new branch on develop and the open PR is targeting develop. If your branch wasn't based on develop read Contribution guide on rebasing branch to develop.
  • IMPORTANT: I made sure that my changes are reflected properly in the following resources:
    • ZenML Docs
    • Dashboard: Needs to be communicated to the frontend team.
    • Templates: Might need adjustments (that are not reflected in the template tests) in case of non-breaking changes and deprecations.
    • Projects: Depending on the version dependencies, different projects might get affected.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Other (add details above)

Implements a new Huggingface deployer that allows deploying ZenML
pipelines as Docker-based Huggingface Spaces. This deployer extends
ContainerizedDeployer and uses the huggingface_hub API to manage
Space lifecycles.

Key features:
- Create and update Huggingface Spaces with Docker SDK
- Support for hardware tier selection (CPU, GPU options)
- Support for persistent storage tiers
- Automatic Space naming with configurable prefixes
- Full deployment lifecycle management (provision, status, deprovision)

Implementation includes:
- HuggingfaceDeployer: Main deployer class with Space management
- HuggingfaceDeployerFlavor: Flavor registration for the stack
- HuggingfaceDeployerConfig: Configuration including token and defaults
- HuggingfaceDeployerSettings: Per-deployment settings for hardware/storage
- HuggingfaceDeploymentMetadata: Metadata tracking for Space deployments

The deployer automatically generates Dockerfile and README.md for Spaces,
handles authentication via token or environment variables, and provides
proper error handling for Space operations.
Addressed several logical issues found during code review:

1. Fixed environment variable name: Changed from HUGGING_FACE_HUB_TOKEN
   to the correct HF_TOKEN as per huggingface_hub documentation

2. Removed unused 'hardware' variable in _create_readme method

3. Improved SpaceStage status mapping to handle all possible states:
   - Added support for RUNNING_BUILDING, BUILD_ERROR, RUNTIME_ERROR
   - Added support for CONFIG_ERROR, NO_APP_FILE, DELETING
   - Better categorization of error vs pending states

4. Fixed Dockerfile environment variable escaping to properly handle
   backslashes followed by quotes

5. Added safety check for empty deployment names in _sanitize_space_name
   to prevent edge case failures

6. Fixed potential None access in log error message by checking if
   space_url exists before using it

These fixes ensure more robust error handling and better compatibility
with the Huggingface Spaces API.
Removed over-engineered code to make it minimal and maintainable:

- Removed HuggingfaceDeploymentMetadata class, use simple dict instead
- Removed _sanitize_space_name complexity, simplified to basic regex
- Inlined _create_readme and _create_dockerfile methods
- Simplified status mapping from 10+ cases to 3 simple cases
- Removed unnecessary exception wrapping and verbose logging
- Removed unused imports and helper methods

The code is now ~200 lines instead of ~500 lines, much easier to
understand and maintain while keeping all core functionality.
Fixed several bugs found during code review:

1. CRITICAL: Fixed Dockerfile env var escaping - values with quotes or
   backslashes would break the Dockerfile. Now properly escapes both.

2. Fixed empty space name handling - if deployment name only contains
   special characters, now defaults to 'deployment' instead of empty string

3. Added try/except around hardware and storage API calls to prevent
   them from crashing the entire provisioning if they fail

4. Changed flavor name from 'huggingface' to 'huggingface-spaces' to
   avoid confusion with the existing model deployer flavor

These fixes make the deployer more robust and handle edge cases properly.
@htahir1 htahir1 added staging-workspace backend Implementations related to the backend labels Nov 2, 2025
@CLAassistant
Copy link

CLAassistant commented Nov 2, 2025

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
2 out of 3 committers have signed the CLA.

✅ htahir1
✅ strickvl
❌ claude
You have signed the CLA already but the status is still pending? Let us recheck it.

@github-actions github-actions bot added the internal To filter out internal PRs and issues label Nov 2, 2025
Aligned the new deployer with existing Hugging Face integration patterns:

1. Naming consistency: Changed 'Huggingface' to 'HuggingFace' (capital F)
   to match existing classes like HuggingFaceModelDeployer

2. Token security: Use SecretField for token to mark it as sensitive,
   following the same pattern as HuggingFaceModelDeployerConfig

3. Secret support: Added secret_name field to fetch tokens from ZenML
   secrets, providing flexibility like the model deployer

4. Stack validation: Added validator to ensure either token or
   secret_name is configured before deployment

5. Token retrieval: Implemented _get_token() method with priority:
   config.token > secret_name > HF_TOKEN env var

All class names, patterns, and conventions now match the existing
Hugging Face integration for consistency and maintainability.
The flavor name should be 'huggingface' not 'huggingface-spaces'.
Model deployer and deployer are different component types so there's
no naming collision.
Fixed all linting issues found by scripts/lint.sh and scripts/docstring.sh:

1. F821 - Added HfApi to TYPE_CHECKING imports to fix undefined name error
2. DAR302 - Removed 'Yields' section from do_get_deployment_state_logs
   since it only raises an exception
3. DAR401 - Added Exception to Raises section in do_deprovision_deployment
   to document the re-raised exception

All linting checks now pass.
- Created comprehensive documentation for the Hugging Face deployer in docs/book/component-guide/deployers/huggingface.md
- Added entry to deployer table of contents (toc.md)
- Updated deployers README.md with Hugging Face deployer information
- Documented important limitation about Docker image accessibility and private registries
- Included configuration examples, settings, and usage instructions
@github-actions
Copy link
Contributor

github-actions bot commented Nov 2, 2025

Documentation Link Check Results

Absolute links check passed
Relative links check passed
Last checked: 2025-11-05 11:15:33 UTC

## Major Changes

### 1. Added Deployment Server Entrypoint
- Dockerfiles now include proper ENTRYPOINT and CMD instructions
- Uses DeploymentEntrypointConfiguration to generate correct startup command
- Deployment server now starts automatically with deployment ID parameter

### 2. Implemented Two-Mode Deployment System

#### Mode 1: Image Reference (with container registry)
- References pre-built Docker image from container registry
- Lightweight Dockerfile with FROM, ENV, USER, ENTRYPOINT, CMD
- Image must be publicly accessible (documented limitation)

#### Mode 2: Full Build (without container registry)
- Builds complete image from scratch in Hugging Face Spaces
- Uploads source code, requirements.txt, and Dockerfile to Space
- Generates full Dockerfile with dependency installation
- **Solves private registry authentication problem!**

### 3. Implementation Details
- Added _get_entrypoint_and_command() helper method
- Added _generate_image_reference_dockerfile() for Mode 1
- Added _generate_full_build_dockerfile() for Mode 2
- Added _get_requirements_for_deployment() to gather dependencies
- Modified do_provision_deployment() to detect stack configuration and choose mode
- Mode selection based on stack.container_registry presence

### 4. Documentation Updates
- Replaced "Important Limitations" section with "Deployment Modes"
- Documented both deployment modes with use cases
- Added clear workarounds for private registry issues
- Provided examples for stack configuration

## Benefits
- Fixes missing entrypoint issue - deployments can now start properly
- Provides solution for private registry problem via full-build mode
- Maintains backward compatibility for users with public registries
- Automatic mode selection based on stack configuration
Replace custom _generate_full_build_dockerfile implementation with
ZenML's internal PipelineDockerImageBuilder._generate_zenml_pipeline_dockerfile
method to:

- Eliminate code duplication
- Ensure consistency with how ZenML builds Docker images elsewhere
- Automatically benefit from future improvements to internal method
- Reduce maintenance burden

Changes:
- Import PipelineDockerImageBuilder and json module
- Modify _generate_full_build_dockerfile to call internal method
- Create requirements_files in format expected by internal method
- Merge environment/secrets into docker_settings.environment
- Internal method handles: FROM, WORKDIR, ENV, apt packages, requirements, COPY, USER
- Manually append ENTRYPOINT and CMD using json.dumps() for exec form
- Update docstring to document use of internal method

Benefits:
- ~50 lines of code eliminated
- Consistent with ZenML patterns (python_package_installer, installer_args, etc.)
- Proper handling of docker_settings.local_project_install_command
- Correct WORKDIR and permissions setup
Based on testing feedback, made the following critical fixes:

## Code Changes

1. **Fixed ENTRYPOINT/CMD Serialization (Line 217-218)**
   - Changed from str() to json.dumps() for proper Docker exec form
   - Before: str(entrypoint) → ['python', ...] (single quotes, invalid)
   - After: json.dumps(entrypoint) → ["python", ...] (double quotes, valid)
   - Fixes: "/bin/sh: 1: [python,: not found" error

2. **Fixed Default Port (Line 78, 107)**
   - Changed app_port default from 7860 to 8000
   - 7860 is HF Spaces default, but ZenML server runs on 8000
   - Updated both code and documentation

3. **Added Container Registry Requirement to Validator (Lines 102-140)**
   - Stack validator now requires container registry
   - Prevents deployment without pre-built image
   - Clear error message explains requirement

4. **Removed Full Build Mode (Lines 250-376 deleted)**
   - Deleted _generate_full_build_dockerfile method
   - Deleted _get_requirements_for_deployment method
   - Simplified do_provision_deployment to single mode
   - Removed unused imports (source_utils, PipelineDockerImageBuilder)
   - Full build mode caused issues:
     * Uploaded entire codebase (inefficient)
     * Configuration problems
     * Recommended to always use container registry

## Documentation Changes

1. **Updated Port Documentation**
   - Changed default from 7860 to 8000 in docs
   - Added note that 8000 is ZenML server default

2. **Removed Deployment Modes Section**
   - Deleted entire two-mode explanation
   - Simplified to single-mode operation

3. **Added Important Requirements Section**
   - Container Registry Requirement subsection
   - Explains why public access is needed
   - Lists recommended registries (Docker Hub, GHCR)
   - Example setup with GitHub Container Registry

4. **Added X-Frame-Options Configuration Section**
   - Documents iframe embedding issue
   - Provides complete code example
   - Shows DeploymentSettings with SecureHeadersConfig
   - Explains: xfo=False disables X-Frame-Options header
   - Without this, HF Spaces shows blank page

## Summary of Fixes

✅ ENTRYPOINT/CMD serialization (json.dumps)
✅ Default port 7860 → 8000
✅ Container registry now required by validator
✅ Full build mode removed
✅ Documentation updated with requirements
✅ X-Frame-Options configuration documented

These changes address all issues discovered during testing.
Copy link
Contributor

@strickvl strickvl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing that's probably not ideal is that (esp in the context of setting these as public by default) we write the secrets to the Dockerfile which then means your HF tokens (and whatever else) is basically leaked / in public the moment you use our deployer. I think HfApi.add_space_secret(repo_id, key, value) and HfApi.add_space_variable(repo_id, key, value) are the preferred ways of handling secrets / env vars with spaces, which we should use instead of injecting them into the Dockerfile.

Otherwise a bunch of smaller comments below.


* Hugging Face Spaces-specific settings:

* `space_hardware` (default: `None`): Hardware tier for the Space (e.g., `'cpu-basic'`, `'cpu-upgrade'`, `'t4-small'`, `'t4-medium'`, `'a10g-small'`, `'a10g-large'`). If not specified, uses free CPU tier.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe reference this page? https://huggingface.co/docs/hub/spaces-gpus


* `space_hardware` (default: `None`): Hardware tier for the Space (e.g., `'cpu-basic'`, `'cpu-upgrade'`, `'t4-small'`, `'t4-medium'`, `'a10g-small'`, `'a10g-large'`). If not specified, uses free CPU tier.
* `space_storage` (default: `None`): Persistent storage tier for the Space (e.g., `'small'`, `'medium'`, `'large'`). If not specified, no persistent storage is allocated.
* `private` (default: `False`): Whether to create the Space as private. Public Spaces are visible to everyone.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we default to private?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@strickvl i dont think so actually... i think most of the time its a demo or something?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps. Personally I'd always rather something defaulted to private and then allowed me to turn it on / set it to public whenever I want. That way at least you avoid having leaked something you didn't want to.

- Source code and dependencies uploadable to Hugging Face

**Use when:**
- You don't want to manage a container registry
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

earlier in the docs you said that a remote container registry was required

| [Docker](docker.md) | `docker` | Built-in | Deploys pipelines as locally running Docker containers |
| [GCP Cloud Run](gcp-cloud-run.md) | `gcp` | `gcp` | Deploys pipelines to Google Cloud Run for serverless execution |
| [AWS App Runner](aws-app-runner.md) | `aws` | `aws` | Deploys pipelines to AWS App Runner for serverless execution |
| [Hugging Face](huggingface.md) | `huggingface` | `huggingface` | Deploys pipelines to Hugging Face Spaces as Docker applications |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| [Hugging Face](huggingface.md) | `huggingface` | `huggingface` | Deploys pipelines to Hugging Face Spaces as Docker applications |
| [Hugging Face](huggingface.md) | `huggingface` | `huggingface` | Deploys pipelines to Hugging Face Spaces as Docker Spaces |

),
)
except Exception as e:
logger.warning(f"Failed to set hardware {hardware}: {e}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we actually fail here instead of just logging a warning?

The Space ID in format 'username/space-name'.
"""
api = self._get_hf_api()
username = api.whoami()["name"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this work for an organization? def works for a normal user, but does it work if you want to deploy to the ZenML org on HF Spaces? Not sure it would.

Comment on lines 300 to 301
api.space_info(space_id)
logger.info(f"Updating existing Space: {space_id}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that if someone changes the visibility of the space (from private to public or vice-versa) then it wouldn't actually make those changes with the way the code is currently written, so we should make sure to handle that.

## Security Issue
Previously, environment variables and secrets were written directly into
the Dockerfile using ENV instructions. This meant that:
- ❌ ALL secrets were exposed in the Dockerfile
- ❌ Anyone with access to the Space could view credentials
- ❌ Especially dangerous since Spaces are PUBLIC by default (private: False)
- ❌ Secrets were permanently in the repository history

This was a critical security vulnerability that could leak API keys,
database credentials, and other sensitive information.

## Security Fix

### Code Changes (huggingface_deployer.py)

1. **Updated _generate_image_reference_dockerfile (Lines 217-247)**
   - Removed environment and secrets parameters
   - No longer generates ENV lines in Dockerfile
   - Added security note in docstring
   - Dockerfile now only contains: FROM, USER, ENTRYPOINT, CMD

2. **Updated do_provision_deployment (Lines 312-363)**
   - Changed method call: removed environment, secrets args
   - Added secure environment variable handling (lines 337-350):
     * Uses api.add_space_variable() for each environment variable
     * Variables stored securely by Hugging Face
     * Not exposed in Dockerfile or repository
   - Added secure secrets handling (lines 352-363):
     * Uses api.add_space_secret() for each secret
     * Secrets encrypted and never exposed
     * Safe even with public Spaces

### Documentation Changes (huggingface.md)

Added "Secure Secrets and Environment Variables" section (Lines 158-175):
- Explains security approach with success hint
- Documents use of HF Space Secrets/Variables API
- Clarifies that nothing is written to Dockerfile
- Emphasizes safety with public Spaces (the default)
- Lists security benefits with checkmarks

## Security Improvements

✅ Secrets encrypted and never exposed in repository
✅ Environment variables managed through secure API
✅ No credentials in Dockerfile
✅ Safe to use with public Spaces (default behavior)
✅ No risk of credential leakage even if Space is public
✅ Follows security best practices

## How It Works Now

1. Dockerfile is generated WITHOUT any ENV instructions
2. After uploading Dockerfile, deployer calls:
   - `api.add_space_variable(repo_id, key, value)` for each env var
   - `api.add_space_secret(repo_id, key, value)` for each secret
3. Hugging Face injects these at runtime securely
4. No credentials ever appear in repository files

This approach is the recommended way to handle secrets/env vars with
Hugging Face Spaces according to their API documentation.
- Added link to https://huggingface.co/docs/hub/spaces-gpus for space_hardware setting
- Helps users understand available GPU options and pricing
- Addresses documentation feedback
Implements 8 improvements from code review:

1. Add organization support for deploying to HF organizations
   - Added 'organization' config parameter to HuggingFaceDeployerConfig
   - Updated _get_space_id() to check organization before falling back to username

2. Add space name length validation
   - Added HF_SPACE_NAME_MAX_LENGTH constant (96 chars)
   - Validate space name length and raise DeployerError if exceeded

3. Handle space visibility updates
   - Check if space_info.private != settings.private when updating
   - Call api.update_repo_visibility() to apply visibility changes

4. Fail on invalid hardware/storage instead of warning
   - Changed hardware/storage errors to raise DeploymentProvisionError
   - Added clear error messages with documentation links

5. Use proper error types for 404 detection
   - Import HfHubHTTPError from huggingface_hub.utils
   - Check e.response.status_code == 404 instead of string matching

6. Document timeout parameter in deprovision
   - Added docstring note that timeout is unused (deletion is immediate)

7. Remove unused space_exists variable
   - Removed space_exists assignments to fix lint error

8. Update documentation terminology
   - Changed "Docker applications" to "Docker Spaces" in deployers README

All changes maintain backward compatibility and improve code quality.
The secret_name parameter was redundant since token is already a SecretField
that supports ZenML's secret reference syntax ({{secret.key}}).

Changes:
- Removed secret_name from HuggingFaceDeployerConfig
- Simplified _get_token() to just return config.token or environment variable
- Updated validator to only check for token (not token or secret_name)
- Updated documentation to show proper secret reference syntax: {{hf_token.token}}
- Removed unused Client import (auto-removed by formatter)

This simplifies the API and follows ZenML's standard pattern for secret handling.
Added detailed descriptions to all config fields following ZenML standards:
- Minimum 30 characters
- Action-oriented language
- Concrete examples with realistic values
- Clear format specifications and constraints

Field descriptions added:
- token: Authentication, secret syntax, permissions, example
- organization: Purpose, example URL, permission requirements
- space_hardware: Options with specs, GPU tiers, documentation link
- space_storage: Tiers with sizes, persistence behavior
- space_prefix: Purpose, naming example, length constraint

All descriptions include practical examples and relevant documentation links
to help users configure the deployer correctly.
Changed the default value of `private` parameter from False to True to follow
security best practices. Private by default prevents accidental exposure of
deployment information.

Changes:
- Set private=True as default in HuggingFaceDeployerSettings
- Updated docstring to indicate default is True for security
- Updated documentation to reflect new default value
- Removed redundant private=True from code example (now default)
- Updated security section to mention both private and public Spaces
- Clarified that secure secrets handling protects credentials even in public Spaces

Users can still explicitly set private=False to make Spaces publicly visible,
but the safer default protects users who don't explicitly configure visibility.
@htahir1 htahir1 requested a review from strickvl November 3, 2025 10:34
Implements all requested improvements from second round of code review:

1. Use specific HfHubHTTPError exception instead of broad Exception
   - Check for 404 status code when space_info fails
   - Raise DeploymentProvisionError for other HTTP errors
   - Provides better error messages and debugging

2. Fail early on environment variable failures
   - Changed from logger.warning to raise DeploymentProvisionError
   - Deployment will fail immediately if env vars can't be set
   - Prevents deployment from starting without required configuration

3. Fail early on secret failures
   - Changed from logger.warning to raise DeploymentProvisionError
   - Deployment will fail immediately if secrets can't be set
   - Critical since secrets are needed for ZenML server connection

4. Fix RUNNING_BUILDING state mapping
   - Moved RUNNING_BUILDING from RUNNING to PENDING status
   - Only fully running Spaces return RUNNING status
   - Health endpoint not available during RUNNING_BUILDING

5. Store external state in metadata for debugging
   - Added runtime.stage to DeploymentOperationalState metadata
   - Helps with debugging deployment status issues
   - Provides visibility into HF Space's internal state

6. Use correct exception type in deprovision
   - Changed from DeploymentNotFoundError to DeploymentDeprovisionError
   - Signals backend errors vs successful deletion
   - Updated docstring to document both exception types

7. Document API behavior with clarifying comments
   - add_space_variable/secret are upsert operations (add or update)
   - request_space_hardware/storage replace the current tier
   - Clarifies behavior for redeployments

8. Handle space_id mismatch corner case
   - Detect when space_id changes (renamed deployment/changed prefix)
   - Automatically deprovision old Space before creating new one
   - Prevents orphaned Spaces from accumulating

All changes improve error handling, debuggability, and resource cleanup.
claude and others added 4 commits November 3, 2025 12:33
Addresses feedback about using non-standard deployment states. The previous
implementation introduced HuggingFace-specific stages (RUNNING_BUILDING,
NO_APP_FILE) into ZenML status mapping without properly handling all cases.

Changes:
- Import and use SpaceStage enum instead of string matching
- Map all 10 HuggingFace Space stages to ZenML's 5 standard states:
  * RUNNING → RUNNING (only when fully provisioned)
  * BUILDING, RUNNING_BUILDING → PENDING (health endpoint not available)
  * BUILD_ERROR, RUNTIME_ERROR, CONFIG_ERROR, NO_APP_FILE → ERROR
  * STOPPED, PAUSED, DELETING → ABSENT (exists but not running)
  * Unknown stages → UNKNOWN (future-proofing)

Key fix: RUNNING_BUILDING now correctly maps to PENDING, not RUNNING, because
the health endpoint is not available during this rebuild phase.

Follows same pattern as GCP/AWS deployers which map external service states
to ZenML's standard deployment lifecycle states.
Fixed lint error - DeploymentDeprovisionError was used but not imported.
This exception is used in do_deprovision_deployment when deletion fails.
CRITICAL BUG FIX: The deployer was returning the HuggingFace Space page URL
(https://huggingface.co/spaces/{space_id}) instead of the actual deployment
endpoint URL. This caused the base deployer to continuously poll because the
health check was hitting the HF page instead of the deployment server.

Changes:
- Extract the actual domain from runtime.raw['domains']
- Construct proper deployment URL: https://{domain}
- Only set URL when status is RUNNING (follows GCP/AWS pattern)
- URL is None for non-RUNNING states

Example:
- Before: https://huggingface.co/spaces/zenml/zenml-weather_agent-5917ffec
- After:  https://zenml-zenml-weather_agent-5917ffec.hf.space

This allows the base deployer's health check to succeed and stop polling
once the deployment is fully ready.
htahir1 and others added 3 commits November 3, 2025 13:56
Additional fix for continuous polling issue. The Space can be in RUNNING
stage (Docker container started) but the domain might not be ready yet
(DNS propagating, routing not configured). This caused premature RUNNING
status reports.

Changes:
- Only return DeploymentStatus.RUNNING when BOTH conditions are met:
  1. runtime.stage == SpaceStage.RUNNING
  2. domains[0]['stage'] == "READY"
- Space RUNNING + domain not ready → PENDING status
- Only set deployment URL when domain stage is READY
- Added domain_stage to metadata for debugging

This ensures health checks only run when the domain is actually ready
to receive traffic, not just when the Docker container has started.

Note: If polling continues after domain is READY, it likely means the
FastAPI app inside the container is still initializing. The base deployer
will continue polling until the /health endpoint responds with 200 OK.
Changed import from huggingface_hub.utils to huggingface_hub.errors
to fix mypy errors about implicit re-exports. HfHubHTTPError is
defined in the errors module and should be imported from there directly.
Move HuggingFaceDeployerSettings to flavor module and make
HuggingFaceDeployerConfig inherit from it, following the pattern
used by other deployers in the codebase.

Key changes:
- Move HuggingFaceDeployerSettings from deployer to flavor module
- Make HuggingFaceDeployerConfig inherit from HuggingFaceDeployerSettings
- Remove duplicate space_hardware and space_storage fields from config
- Remove app_port setting and use uvicorn_port from DeploymentSettings
  instead, consistent with docker deployer pattern
- Add comprehensive Field descriptions for settings fields

This addresses PR feedback from @stefannica.
@htahir1 htahir1 requested a review from stefannica November 3, 2025 14:30
claude and others added 2 commits November 3, 2025 14:41
Added debug logging to help diagnose why deployments might get stuck
in pending state even when Space and domain are ready. This will log
the Space stage, domain stage, and domain availability to help
identify any issues with state detection logic.

Also cleaned up redundant domain_stage variable assignments.
claude and others added 8 commits November 5, 2025 08:13
When provisioning a deployment to an existing Space that is STOPPED
or PAUSED, the deployer now automatically restarts it. This ensures
that deployments work correctly even when reusing Spaces that have
been put to sleep by HuggingFace's auto-sleep mechanism.

Key changes:
- Check Space runtime state when updating existing Space
- Call restart_space() API if Space is STOPPED or PAUSED
- Add logging to indicate when Space is being restarted

This fixes the bug where deployments would fail silently when the
target Space was in a sleeping state.
Added logging to show the deployment URL when the Space domain is READY.
This helps diagnose health check issues during polling by showing exactly
what URL the base deployer will attempt to connect to for health checks.
Fixed a critical bug where we were comparing runtime.stage (a string)
directly to SpaceStage enum objects, which always evaluated to False.
This caused deployments to never reach RUNNING status and continue
polling forever.

The HuggingFace API returns runtime.stage as a string (e.g., "RUNNING"),
so we must use enum.value for comparison (e.g., SpaceStage.RUNNING.value).

Changes:
- Use SpaceStage.RUNNING.value instead of SpaceStage.RUNNING
- Use enum.value for all stage comparisons
- Added comments explaining that runtime.stage is a string
- Maintains type safety by using enum values instead of hardcoded strings

This fixes the infinite polling issue where deployments would never
stop polling even when the Space was running and healthy.
Added detailed logging to diagnose two issues:

1. **Unknown status detection**: When the deployment status shows as
   "unknown", we now log the actual stage value received from HuggingFace
   and list all known stages. This helps identify if HuggingFace has
   introduced new stages we don't recognize yet.

2. **Health check failures**: Override _check_deployment_health with
   better logging to show:
   - The exact health check URL being tested
   - Whether health check passed or failed
   - Specific error messages when health check fails
   - Status codes returned from the endpoint

These logs will help diagnose why deployments continue polling even
when the Space appears to be running and the health endpoint is working.

The logs now show stage transitions and health check results at INFO
level for easier debugging without requiring DEBUG logging.
Fixed two critical issues discovered in deployment logs:

1. **Added RUNNING_APP_STARTING stage support**: HuggingFace introduced
   a new intermediate stage "RUNNING_APP_STARTING" that occurs when the
   container is running but the application inside is still starting up.
   Map this to PENDING status since the health endpoint isn't ready yet.

2. **Fixed DeploymentDefaultEndpoints import**: Corrected import path
   from `zenml.enums` to `zenml.config.deployment_settings` to fix
   ImportError that was preventing health checks from running.

These fixes allow deployments to properly transition through all HuggingFace
Space stages and reach RUNNING status once the app is fully started.
Fixed the issue where private HuggingFace Spaces would continuously poll
because the HTTP health check endpoint returns 404 for unauthenticated
requests.

Why this is the right solution:
1. Private Spaces block unauthenticated HTTP requests (returns 404)
2. We already have reliable health state from HuggingFace API
3. When Space stage is RUNNING and domain is READY, we know the
   deployment is genuinely healthy
4. HuggingFace platform handles internal health checks for us

The health check method now simply returns True, relying on the
comprehensive Space runtime state validation we perform in
do_get_deployment_state(). This allows deployments to complete
successfully for both public and private Spaces.
Cleaned up excessive logging that was added for debugging:
- Removed deployment URL log that printed on every poll
- Removed all debug logs showing space state details
- Simplified unknown stage warning message
- Removed health check debug log

Kept important informational logs:
- Creating/updating Space
- Restarting sleeping Spaces
- Unknown stage warnings (simplified)

This significantly reduces log verbosity while maintaining visibility
into key deployment lifecycle events.

Note: HTTP Request logs visible in output are from huggingface_hub
library's own logging, not from our deployer code.
@htahir1 htahir1 merged commit 39f0e07 into develop Nov 5, 2025
71 of 78 checks passed
@htahir1 htahir1 deleted the claude/huggingface-deployer-011CUj51UperMPAytgCheWG1 branch November 5, 2025 16:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend Implementations related to the backend internal To filter out internal PRs and issues run-slow-ci staging-workspace

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants